A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
نویسندگان
چکیده
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource—one that is, however, also unreliable, heterogeneous, and opportunistic. We describe an algorithm that allows branch-and-boundproblems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, in which the dynamically available resources are managed through a membership protocol. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information reliably, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alternatives. Results obtained in this framework suggest that our techniques can execute scalably and reli-
منابع مشابه
A Case Study of Agreement Problems in Distributed Systems: Non-Blocking Atomic Commitment
This paper considers an agreement problem whose practical interest is well known, namely the Non-Blocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantiations of its generic statements are provided for both synchronous and asynchronous distributed systems. These instantiations use a few basic components: timeout mechanism and reliable multic...
متن کاملConsensus in Asynchronous Distributed Systems
The distributed consensus problem arises when several processes need to reach a common decision despite failures. The importance of this problem is due to its omnipresence in distributed computation: we need consensus to implement reliable communications, atomic commitment, consistency checks, resources allocations etc. The solvability of this problem is strictly related to the nature of the sy...
متن کاملRevisiting the Non-Blocking Atomic Commitment Problem in Distributed Systems
Agreement problems allow a set of processes to agree on a common output value. These problems are of primary importance in distributed systems and di cult to solve in presence of failures. This paper considers one of these problems whose practical interest is well known, namely the Non-Blocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantia...
متن کاملFault-Tolerant Distributed Systems: a Modular Approach to the Non-Blocking Atomic Commitment Problem
Agreement problems allow a set of processes to agree on a common output value. These problems are of primary importance in distributed systems and di cult to solve in presence of failures. This paper considers one of these problems whose practical interest is well known, namely the NonBlocking Atomic Commitment Problem. First, a generic protocol solving this problem is given and then instantiat...
متن کاملSolving Consensus in a Byzantine Environment Using an Unreliable Fault Detector
Unreliable fault detectors can be used to solve the consensus problem in asynchronous distributed systems that are subject to crash faults. We extend this result to asynchronous distributed systems that are subject to Byzantine faults. We define the class 3S(Byz) of eventually strong Byzantine fault detectors and the class 3W(Byz) of eventually weak Byzantine fault detectors and show that any B...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000